-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Interaction and Better Testing #71
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the PR, make stuff more clean and add a way to fetch the raw files.
Beside some small comments, I think that we should refactor the DES classes. Currently we have 4 files for DES:
DES370K, DES5M(DES370K), DES66 , DES66x8
I would move DES5M and DES370K classes in a single DES file.
I would do the same with DES66 and DES66x8 into a DES66 file and make one of them inherit the other one to have a more complex type that we could use in case.
More importantly. I think there is an error in the DES5M read_raw_entries(self)
method.
Currently the class is a child of DES370K:
def read_raw_entries(self) -> List[Dict]:
return DES5M._read_raw_entries()
This seems wrong as it seems to me that it is recurrently calling itself. It should be:
def read_raw_entries(self) -> List[Dict]:
return super()._read_raw_entries()
or
def read_raw_entries(self) -> List[Dict]:
return DES370K._read_raw_entries()
I don't have thoroughly tested it currently. I'll try to run/debug the classes better sunday or monday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple minor comments, feel free to address those, but otherwise looks good I think!
Interaction dataset refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interaction dataset module is getting cleaner and I quite like it. I know there is some issues on naming, restraining the data keys to property methods but I think overall this kind of inheritance will help us along the way as the preprocess methods should be hardly touched at all and if we need to touch them, it means that we did something wrong at the beginning.
Nonetheless, there are some issues on changing the keys (+ some name kerfuffles) that requires reproprecessing the datasets and repushing them to the cloud. Currently because of that I m unable to give you a more in depth review as my testing can be quite limited.
Side note:
- We should stop calling
__name__
directly and provide a propertyname
that sanitize the name to avoid problems - We really need more solid tests
As soon as we fix the main issues, I can do a round of testing for a final approval if you think it is needed. This should be a good time to actually clean the push methods as we currently are forced to always push to cloud and it is a not intended behaviour
@@ -61,7 +35,7 @@ def __getitem__(self, idx: int): | |||
) | |||
name = self.__smiles_converter__(self.data["name"][idx]) | |||
subset = self.data["subset"][idx] | |||
n_atoms_first = self.data["n_atoms_first"][idx] | |||
n_atoms_ptr = self.data["n_atoms_ptr"][idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the datasets are currently using the n_atom_first keyword that is processed in the read_raw_entries . We need to recompute and push the data with this new keywork
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to recompute and push, before merging these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently trying to load any interaction dataset will get you an error due to the:
if not self.is_preprocessed() failing due to the naming.
In the bucket they were written L7 and X40 (upper case). We should always have the sanitize name on lower case. As we need to postprocess it again to have the new keys. It will fix itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, need to push new changes.
"name": str, | ||
"subset": str, | ||
"n_atoms": np.int32, | ||
"n_atoms_ptr": np.int32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the n_atoms_ptr
will make the read_preprocess
fail for the interaction datasets because we currently don't have that key inside (see my other comments).
More in depth, Line 393: assert all([key in all_pkl_keys for key in self.pkl_data_keys])
will give you :
['name', 'subset', 'n_atoms', 'n_atoms_ptr']
[True, True, True, False]
@@ -78,68 +52,18 @@ def __getitem__(self, idx: int): | |||
name=name, | |||
subset=subset, | |||
forces=forces, | |||
n_atoms_first=n_atoms_first, | |||
n_atoms_ptr=n_atoms_ptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before
openqdc/datasets/interaction/des.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to change n_atom_first to n_atom_ptr and preprocess it again + push
openqdc/datasets/interaction/l7.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to change n_atom_first to n_atom_ptr and preprocess it again + push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to change n_atom_first to n_atom_ptr and preprocess it again + push
openqdc/datasets/interaction/x40.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to change n_atom_first to n_atom_ptr and preprocess it again + push
Adressed this right now in the https://github.com/OpenDrugDiscovery/openQDC/tree/qol |
As I preprocessed and merged this branch in QoL, I m going to close it |
Fixes #63
💯 Shoutout for already existing code from Danny which made working on this super easy 💯
Dataset Preprocessing
read_raw_entries
for both L7 and X40read_raw_entries()
Simplified the Interaction Dataset class by using
BaseInteractionDataset
to enforcen_atoms_first
is present in thedata_dict
x[x==None] = -1
done in save_preprocess by doing the changes inread_raw_entries
. For now I noticed only Splinter doing this.Testing
Feel free to drop suggestions @prtos @mcneela @FNTwin
Checklist: